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Abstract 

We outline the super-resolution reconstruction problem posed as a maximization 
of probability. We then introduce an interpolation method based on polygonal 
pixel overlap, express it as a linear operator, and use it to improve reconstruction. 
Polygon interpolation outperforms the simpler bilinear interpolation operator and, 
unlike Gaussian modeling of pixels, requires no parameter estimation. A free 
software implementation that reproduces the results shown is provided. 



1 Introduction 

Super-resolution imaging is the process whereby several low-resolution photographs of an object 
are combined to form a single high-resolution estimation. The problem, as most commonly viewed 
today, was outlined by Cheeseman et al. [3], and has as its solution a maximum a posteriori point 
estimate. Given a number of low-resolution images, concatenated to form the vector b, as well as 
accompanying camera alignment parameters, c, the super-resolution problem is the estimation of a 
high-resolution image, x, such that the posterior 

P(x|b,c) 

is maximized. The problem becomes more tractable if viewed, via Bayes's theorem, as the maxim- 
ization of 

We can also think of it as the maximum likelihood estimator, regularized by the term P(x|c), which 
answers the question: "Given a high-resolution image of a scene and camera positions, how would 
the obtained low-resolution images look?". 

We compute the solution to the the above problem, after assuming that a linear relationship, Ax = b, 
exists between the high-resolution reconstruction x and the low-resolution input images b. We 
explore the structure of A, which incorporates aspects such as geometric distortion and interpolation, 
and show that an interpolation operator based on pixel overlap has desirable properties. 



2 Image formation model 

Image formation is the process whereby light, reflected from a scene, travels through an optical 
system and causes accumulation of charge in a photosensitive sensor element. The charge values 
are read out, amplified, discretised and possibly processed before being stored as image intensity 
values. 

Super-resolution relies on slight shifts in camera (or object) positions between several input frames 
to provide high-frequency information lost during sampling. Images are registered, preferably to 
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sub-pixel accuracy, before one of several methods (averaging or "stacking", map and deblur [8], 
pan- sharpening [11, 7], supervised super-resolution [18], etc.) is applied to restore high-frequency 
detail. 

The image acquisition process is often represented as the simplified model 

b« =S;(/i(T(x)))+77 (2) 

where b^) is the z-th low-resolution digital image obtained, x is a high-resolution representation 
of scene radiance, T is a geometric transformation dependent on camera position, h is the camera 
point-spread function, S I is the downsampling operator and 77 is normally distributed noise. 

This model makes two main assumptions: that low-resolution images can be reproduced from the 
high-resolution image, and that any additive noise is linear, zero-mean and Gaussian. Since, in 
reality, the camera generates low-resolution images based on the scene itself, the first assumption 
holds only if the high-resolution image is a fairly good representation of the true scene radiance. As 
for the second, noise sources vary according to exposure level [4], but among the different prominent 
sources (readout, shot and fixed pattern [10]) only fixed-pattern noise has non-zero mean [9], and 
can be removed using flat-field correction [5] without adversely affecting reconstruction. 

Estimating the high-resolution image from a set of low-resolution images only becomes tractable 
once further assumptions are made. In our case, we make the fairly weak assumption that the linear 
relationship 

b W =A^X + T7 

holds, i.e., that the intensity of a pixel in any of the low-resolution images may be obtained as a 
linear combination of corresponding neighboring pixels in the high-resolution image. 



3 The maximum likelihood solution 



Experiments show [ , 3] that the per-pixel intensity probability, P(&|x, c) with b G b, can be ap- 
proximated as a Gaussian distribution, 

P(6|x,c) = N{b,* h ) = -1 c -( 6 - 5 ) a /( 2 ^), 

with the x-dependence through b = Ax. If we assume independence of pixels across different low- 
resolution images (reasonable, given that multiple factors such as noise and registration errors are 
at play), as well as independence over neighboring pixels (since neighborhood influence is highly 
localised), the distribution becomes 



P(b|x,c) = — exp (-i(b - AxfE-^b - Ax)) , 

with b set to Ax, according to our assumed model. Maximizing the log probability under the 
assumptions of spherical covariance E = a 2 1, and prior P(x|c) Gaussian with zero mean, yields 



Ax T x 



arg maxP(b|x, c) = x = argmin ||b — Ax\\ 

X x L 

The form of ( ) is well known as the regularised or damped solution to the least-squares problem 

Ax = b. 



(3) 



Yet, the assumption of a zero-mean prior is known to be invalid — the output image is unlikely to be 
black. We therefore transform the problem to seek a solution around an arbitrary prior estimate, x : 



b - Ax 



£x — arg min 

<5x 



A5x-b 



+ A(<5x) T (<5x) 



(4) 



x = x + £x. (5) 

Here, the difference between the prior estimate, x , and the reconstructed image, x, is expected to 
be distributed around zero. 
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4 The image formation matrix, A 



In this section, we discuss the structure of the image formation matrix that approximates (2) 
by encapsulating three processes: geometric transformation, the effect of the point- spread function 
as well as down-sampling. More concretely, what do the values in represent, and how can they 
be computed? 

First, consider the dimensions of the vectors and matrices involved. The noiseless model for a single 
image, A^x = bW, includes the i-th low-resolution output image, aPxQ image unpacked 
into a vector. For notational convenience, assign M = PQ. The vector x is the high-resolution 
zP x zQ image, also unpacked. The zoom factor, z with z > 1, represents the increase in resolution; 
e.g., if z = 2 then the high-resolution image has twice as many pixels as the low-resolution image 
along each axis. The dimensionality of x is N = z 2 M. Given the dimensions of b^ and x, the 
shape of the matrix A has to be M x N = M x z 2 M. 

Since M < N, the system A^x = b^ ^ is underdetermined, but when combining all k 
transformation-interpolation matrices, 



A = 



A (o) 
A (i) 

A (k) 



and b = 



b (°) 
b (D 



the resulting system Ax = b is overdetermined as long as k > z 2 . Note that, even when combining 
a large number of images, there is no guarantee that each additional image provides independent 
information (for example, the case in which k identical frames are combined). 

Returning to A^\ each row represents weights for values in x, combined to form a single pixel of 
bW, such that the ra-th pixel is 

~~ • (6) 



6$? = E^?, 



can be approximated as 



A (i) _ J»W(7 



or 



A {i) = CT (i) 



where C represents the effect of the point-spread function as a convolution, while represents 
geometric transformation and down- sampling. While not explicitly shown in the above description, 
we choose to express the geometric transformation itself as a linear transformation, p' = Hp, 
where p is a homogeneous pixel coordinate. Non-linear transformation models are just as viable, 
but for simplicity we choose to estimate a homeography H^q during registration. That homeography 
transforms the low-resolution image i onto the reference frame i = (which is typically the first 
image in the sequence, but can be chosen arbitrarily). 

The application order of the operators T"W and C is not arbitrary, and either choice presents diffi- 
culties. Importantly, transforms a high-resolution image into a low-resolution image. Therefore, 
if the convolution C is applied first, forshortening due to the geometric transformation may lead to 
certain areas of the scene being more densely sampled than others. If T is applied first, i.e. if the 
convolution operator acts on the resulting low-resolution image, some samples in the high-resolution 
image may not be taken into consideration at all. 

Capel in [ ] addresses this problem by designing the matrix A as a convolution kernel (representing 
the camera point- spread function), transformed by the known geometric transformation H to modify 
its shape and convolution path. One could imagine this warped convolution "fetching" all high- 
resolution pixels that contribute to a specific low-resolution pixel. The approach works well, but 
introduces some challenges of its own: what, e.g., should the shape and size of the convolution 
kernel be? The camera point spread function is well modelled as a Gaussian kernel, but even so 
the variance, <r 2 , is unknown. How should the transformed kernel be represented and applied? 
Capel models the kernel as a piecewise bilinear surface to allow transformation and integration. The 
variance parameter may be estimated by repeatedly making reconstructions while varying a 2 until 
the result shows little oscillatory behaviour. 
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(a) Input image (one of ten), upscaled using sine 
(Lanczos) interpolation. 



(b) Reconstruction at 5 x zoom with bilinear interpola- 
tion. A high-resolution reconstruction is made, but the 
result is oscillatory due to the bilinear operator's small 
footprint. 





(c) Reconstruction at 1.8 x zoom with bilinear interpol- (d) Reconstruction at 5 x zoom with polygon interpol- 
ation. Note that, while the resolution of this reconstruc- ation. This operator has a variable size footprint, and is 
tion is low, the detail is at least as good as the 5 x re- therefore capable of handling any size zoom factor, 
construction above, but without oscillations. The small 
footprint of the bilinear super-resolution operator is ad- 
equate for such low zoom factors. 



Figure 1: The effect of the zoom factor on bilinear and polygon super-resolution operators. 



Still, we would prefer not to have such free parameters at all, which leads to the simplified operator, 
— r pM 9 proposed in the next section. 



5 Interpolation operators 

By neglecting the effect of the camera point spread function, the image formation matrix, can 
be approximated as the outcome of the geometric warp and down-sampling, 

A {i) = T {i). 

Imagine that is implemented as bilinear interpolation, then this simplification introduces the 
obvious flaw that only 4 high-resolution pixels are used to calculate the value of any single low- 
resolution pixel. In reality, a low-resolution pixel may (and probably will) depend on more high- 
resolution pixels; the exact number being determined by the resolution increase, z, and the severity 
of the transformation, H. If the zoom ratio is chosen conservatively, the approximation may be 
sufficient, as illustrated in Figure 1 . 

Next, we examine the construction of such a bilinear interpolation operator. An improved polygon- 
based interpolation operator is then proposed to address its deficiencies. 



5.1 Bilinear interpolation 

The bilinear transformation/interpolation operator, A = T, has coefficients appearing on and around 
the diagonal. The coefficients in ( ) are derived from bilinear interpolation as follows: 

Suppose a function is known at four grid coordinates surrounding (x,y), namely / 00 = /( [x\ , [y\), 
foi = f([x\ , [y\ + 1), fio = f([x\ + 1, [y\) 9 and / n = f([x\ + 1, [y\ + 1). The value at f(x, y) 
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(a) With bilinear interpolation, the centre (b) With polygon-based interpolation, all high- 
of the transformed pixel (big dot) is used resolution pixels touching the transformed 
to find the nearest surrounding pixel centres pixel are used in computing the value of the 
(small dots). Only those high-resolution pixels low-resolution pixel, according to the amount 
are used to compute the value of the low- of overlap with the transformed pixel, 
resolution pixel, weighted according to their 
distance from the center of the transformed 
pixel. 

Figure 2: A comparison between the footprints of bilinear and polygon-based interpolation. Each 
figure shows a low-resolution pixel, transformed onto the high-resolution image. 



is not available, but a two-fold linear interpolation approximates it as 

f(x,y) « [ 1 -u u 



/oo 


foi 




' 1 -t ' 


ho 


hi _ 




t 



= /oo(l - u)(l -t) + / i(*)(l -u) + /i (tx)(l -t) + /n 0)0) (7) 

with u = x — \x\ and t = y — \_y\ . This is known as bilinear interpolation (even though the 
successive combination of two linear operators is no longer linear). If all known grid- values of 
f(x,y) are placed in a vector, x, then 



f(x,y) w a T x, 



(8) 



where a is a sparse vector of interpolation coefficients given by ( ) (a has mostly zero entries, except 
where elements in x correspond to /oo, foi, fio, or /n). When interpolating for several coordinate 
pairs (xi, yi) simultaneously, (8) becomes 



Ax, with b 



f(x ,y ) 
ffauyi) 

_ f{xN-UVN-l) 



(9) 



For super-resolution reconstruction, we need to warp and down-scale the high-resolution image, 
x, to produce a given low-resolution image, b^. Recall that, after registration, the homeography 
H ii0 that warps low-resolution image i onto low-resolution reference image (i = 0) is known. The 
homography that warps the high-resolution image to the i-th low-resolution image then becomes 



M i = (H i , y 1 S with S 



1/z 
1/z 
1 
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(a) Sparsity structure of the bilinear interpol- (b) Sparsity structure of the polygon interpol- 
ation operator A. Note that no row contains ation operator A. The first rows are empty, 
more than 4 non-zero values. since the corresponding pixels fall outside of 

the input image. The rows contain more coeffi- 
cients than in the case of the bilinear operator, 
indicating a larger footprint. 

Figure 3: Comparison of the structures of the bilinear and polygon image formation matrices. In 
both cases, A was constructed to rotate by 5° and to downsample by 2. Only the first few rows of 
each operator is shown. 




(a) High-resolution input image. The (b) The matrix- (c) The matrix- 
vector representation, x, is obtained vector product, Ax, vector product, 
by unpacking the values in lexico- for the bilinear Ax, for the poly- 
graphic order. operator, reshaped to gon interpolation 

form an image. operator. 



Figure 4: The effect of interpolation operators. 



Rewriting (9) as 



/(Mr^-i) 



where Cj, j = 0, . . . , N — 1 represents all coordinates in a low-resolution frame. With the coordin- 
ates M~ 1 cj known, the matrix A^ can be constructed. Figure illustrates the effect of A on 
x. 



5.2 Polygon-based interpolation 

In [16], a polygon intersection scheme, subsequently found to be similar to NASA's Drizzle 1 [6], is 
presented for image fusion. Both methods rely on intersecting quadrilaterals (four-cornered poly- 
gons that represent pixels) to determine pixel weights, but the formulation as a linear operator — as 
discussed in this paper — proves to be fundamental in its application to super-resolution imaging. 

As in the previous section, we want to express a low-resolution output image as b = Ax. Each 
pixel value, b m , depends on a number of pixels from the high-resolution image, x, weighted by the 
coefficients in row m of the operator A. The motivation behind the polygon interpolation operator 
is as follows: 

A camera sensor is a grid of photo- sensitive cells (think of them as photon buckets, each representing 
a pixel [ ]). Due to micro-lenses, the gaps between cells are negligible. During imaging, the sensor 
irradiance is integrated over each cell for the duration of exposure, after which the values are read 

*http : //stsdas . stsci . edu/multidrizzle 



6 



Algorithm 1 Calculating the coefficients of the polygon interpolation operator A. Also see Fig- 

ure 2b. 

For each low-resolution pixel, b m : 

1. Create a quadrilateral (four-node polygon) from the corner-points of b m . For example, the 
pixel at (0, 0) would correspond to the polygon with nodes 



The subscript L indicates "low-resolution" and the super-script is the pixel number. 

2. Transform the polygon to the high-resolution frame, using the transformation matrix M _1 
given in the previous section. The new corner coordinates are x™ , y™ . If any of the 
coordinates fall outside the high-resolution image, break this loop and continue to the next 
low resolution pixel (there may be other ways to handle boundary problems, but this is 
simple and works well). 

3. Determine the bounding box of the newly formed polygon: 



4. For each high-resolution pixel inside the bounding box: 

(a) Assign the pixel number n = iN + j where (i, j) is the grid position of the high- 
resolution pixel and N is the total number of columns in the high-resolution frame. 

(b) Create a quadrilateral from the corner-points of the high-resolution pixel with vertices 

and y^. 

(c) Measure the area of overlap between the polygons (x™ , y™ ) and (x^-, y#), and as- 
sign the value to A m n . 

5. Divide each row A m ^ by its sum so that the weights add to one. 



out as a matrix. Now, imagine two sensors, one with large cells (low-resolution) and the other with 
small cells (high-resolution), rotated relative to one another. How are the cell values for the different 
sensors related? Our proposed solution is to measure the overlap between the larger and smaller 
cells, as shown in Figure . The value of a (large) low-resolution cell is set to a weighted sum of 
all (small) high-resolution cells, where the weights depend on their overlap. 

Algorithm outlines the calculation of the coefficients in A. The operator is parameter free, and has 
a variable size footprint that covers all relevant high-resolution pixels (see Figure 3). Furthermore, 
it has less of a smoothing effect than the bilinear interpolator. 

Computing the coefficients of A is more expensive for polygon interpolation than for bilinear in- 
terpolation, due to the many polygon area intersection calculations (called "clipping" operations) 
involved. However, two conditions improve execution time when clipping: polygons from the 
high-resolution image are always aligned with the grid (horizontal and vertical boundaries), and 
all polygons involved are convex. 

The first observation is particularly important, since it allows the use of algorithms that clip poly- 
gons to a "viewport" (typically used to determine which part of a polygon falls inside the screen). 
The second means that a simpler class of algorithm can be employed. We use the Liang-Barsky 
algorithm [ ], which is specifically optimized for rapidly clipping convex polygons against a view- 
port. Another applicable approach is that by Maillot [13]. 

The area of the resulting non-self-intersecting clipped polygon is easily determined as given by [1]: 



i=0 

where x and y are the polygon vertices in either clock- wise or anti-clock- wise order. The clipping of 
the entire collection of pixels can easily be parallellized. Also, if only the result of the operator, Ax, 



x^ = (-0.5,0.5,0.5,-0.5) 
yT = (-0.5,-0.5,0.5,0.5). 



xbb = (Lminx^J , [maxx' L ] , [maxx' L ] , [minx^J) 

ybb = (L min yiJ > L min yiJ , r max yi,l , T max yLl) 




N-l 
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(a) Example input frame (one of thirty), upscaled by a factor of 4 using sine (Lanczos) 
interpolation. 




(b) All input frames, upsampled and averaged (stacked), after photometric registration. 




(c) Super-resolution result after photometric registration, using polygon interpolation, 
zoom factor z — 4, and A = 0.05. Note the appearance of the slanted brick-face. 



Figure 5: Super-resolution result achieved with polygon-based interpolation. 



is required, it can be rapidly rendered via a Graphical Processing Unit without explicitly calculating 
any polygon intersections manually (pixels are simply drawn as polygons and transformed; the GPU 
takes care of clipping in its fixed pipeline). Figure 4b illustrates the effect of A on a vector x. 

Note that this interpolation model is only applicable before Bayer demosaicking (a process which 
destroys much of the super-resolution information in any case). 



6 Results and conclusion 



As test data, we use 30 images and their corresponding homographic transformations to a chosen 
reference, obtained from a hand-held video of a library wall (see acknowledgements). We solve the 
regularized least-squares system in (4) using conjugate gradients (L-BFGS [ ] and LSQR [15, 14] 
give comparable results). The prior xo is chosen as the average of all low-resolution images, up- 
scaled. The high-resolution estimate x is then obtained from (5). 

The results shown in Figures 1 and 5 prove the viability of the proposed method. Software that 
implements the entire super-resolution reconstruction is provided under an open source license at 

http : / /merit at . za . net/ supreme. 

We've shown that, in the context of super-resolution imaging, an interpolation operator based on 
polygon intersection holds several advantages: it models the underlying sensor physics, it is easy to 
compute, has a variable footprint that automatically adjusts to the underlying transformation and can 
easily be expressed as a linear operator. The method can also be extended to model pixels as higher 
order polygons, thereby allowing for non-linear image transformations between input images (such 
as those introduced by lens distortions). 

More detail on the various aspects of super-resolution imaging presented here is given in [17]. 
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